Brief overview

The EWAS Catalog contains thousands of associations from hundreds of studies. By far the most common method of measuring DNA methylation amongst these EWAS is in blood using the Illumina Infinium HumanMethylation450 BeadChip (HM450 array). This platform assays fewer than 2% of CpG sites in the human genome, and those selected are ascertained for regions hypothesised to be relevant to gene regulation. Understanding what drives the associations found by measuring DNA methylation in this way could help prioritise CpG sites or regions of the genome to target for future technologies used in EWAS and further, it could guide current EWAS study design (for example by discovering sites which could be removed before analysis).

Data/methods overview

Data were taken from The EWAS Catalog in June 2021. Results/studies that did not meet the EWAS Catalog inclusion criteria were removed. Studies were removed that compared DNA methylation levels between tissue, race, and age. This left 2528 EWAS with 386,096 results at P < 1x10-4.

To allow testing association between EWAS effect estimates and CpG characteristics across traits, beta coefficients were standardised, \(\beta_{standard}\), like so,

\[\begin{equation} \beta_{standard} = \frac{\beta\sigma(x)} {\sigma(y)} \tag{1} \end{equation}\]

where \(\beta\) = beta coefficient, \(\sigma\) = standard deviation, \(x\) = independent variable, \(y\) = dependent variable. As individual participant data were not available to us, the variance in DNA methylation sites was approximated by the variance in DNA methylation at sites as supplied by the Genetics of DNA Methylation Consortium (GoDMC) and the trait variance was estimated by rearranging equation (2) depending on whether DNA methylation was the independent (\(x\)) or dependent (\(y\)) variable in the model.

\[\begin{equation} r^2 = \frac{\beta^2\sigma^2(x)} {\sigma^2(y)} \tag{2} \end{equation}\]

GoDMC also provided the mean levels of DNA methylation at each site. Heritability of DNA methylation at each site has been previously estimated by McRae et al. 2014 and Van Dongen et al. 2016. These values were kindly made publically available by the authors of those studies, in this chapter the estimates of heritability from twin data (Van Dongen et al. 2016) were used.

LOLA, along with data from Roadmap Epigenomics and ENCODE was used to assess the enrichment of DMPs in transcription factor binding sites and chromatin states.

The results are split into the following sections:

  • Description of studies and results
  • Examination of covariates and potentially faulty probes
  • Assessment of replication across EWAS
  • Association of variance and h2 at each CpG site and effect size and replication
  • Enrichment of DMPs across genomic regions

Description of studies and results

Below are tables and figures describing the data.

Table 1: EWAS Catalog studies overview
study-trait value
Number of EWAS 2108
Unique traits 1995
Number of samples 6353039
Median sample size (range) 4,170 (93 - 17010)
Number of associations 205113
Unique CpGs identified 145730
Unique genes identified 19737
Sex (%) Both (78.7), Females (18.4), Males (1.9)
Ancestries Unclear (52.65), EUR (38.03), AFR (3.11), Other (2.54), ADM (2.23), EAS (0.83), SAS (0.61)
Age (%) Adults (84.1), Children (5.7), Infants (5.2), Geriatrics (4.0)
Number of tissue types 53
Most common tissues (%) whole blood (88.21), cord blood (4.39), placenta (1.11), cd4+ t-cells (0.95), saliva (0.83)
The 10 most common trait names and EFO term labels.

Figure 1: The 10 most common trait names and EFO term labels.

EWAS sample characteristics

Figure 2: EWAS sample characteristics

Number of unique traits associated with DNA methylation at each CpG. Sites associated with more than 10 unique traits are highlighted in orange and labelled.

Figure 3: Number of unique traits associated with DNA methylation at each CpG. Sites associated with more than 10 unique traits are highlighted in orange and labelled.

Distribution of r2 values across all CpG sites in The EWAS Catalog

Figure 4: Distribution of r2 values across all CpG sites in The EWAS Catalog

Distribution of the sum of r2 values across each study in The EWAS Catalog.

Figure 5: Distribution of the sum of r2 values across each study in The EWAS Catalog.

Examination of covariates

Each study may have reported results across multiple EWAS models, adjusting for different covariates. In at least one model, 1863 studies adjusted for batch effects, 1983 studies adjusted for cell composition, and 1777 adjusted for both. Of all DMPs identified, 10% were measured by potentially faulty probes and an extra 0.46% were present on sex chromosomes (Figure 6).

The percentage of DMPs that may have been identified by faulty probes and the percentage of EWAS that reported identifying at least one of these probes. The left-hand bar represents all DMPs reported across all EWAS that fit into the categories shown, the right-hand bar represents the number of EWAS that include CpGs that fit into the categories shown. Some CpGs are both on a sex chromosome and were identified as faulty by Zhou et al. They were labelled as ‘potentially faulty’.

Figure 6: The percentage of DMPs that may have been identified by faulty probes and the percentage of EWAS that reported identifying at least one of these probes. The left-hand bar represents all DMPs reported across all EWAS that fit into the categories shown, the right-hand bar represents the number of EWAS that include CpGs that fit into the categories shown. Some CpGs are both on a sex chromosome and were identified as faulty by Zhou et al. They were labelled as ‘potentially faulty’.

Assessment of replication

There were 39 studies that performed a meta-analysis of discovery and replication samples. A further 71 studies performed a separate replication analysis. Together, this provides 2770 associations within the EWAS Catalog that have been replicated at P < 1x10-4.

Across the re-analysed GEO studies, between 0% and 96.875% of DMPs were replicated at P < 1x10-4 (Table 2). Some of these EWAS reported very few DMPs (some only 1) and as they would have used different models, replicating the single reported result was not expected.

Table 2: GEO studies re-analysis
Trait N-DMPs N-replicated Percent-replicated
Age at menarche 1 0 0.00
Arsenic exposure 12 0 0.00
Fetal alcohol spectrum disorder 19 1 5.26
Inflammatory bowel disease 14 13 92.86
Nevus count 1 0 0.00
Psoriasis 16 0 0.00
Rheumatoid arthritis 47,875 116 0.24
Smoking 32 31 96.88
Smoking 30 12 40.00

Using EWAS results of the same traits across different studies, replication could be assessed. Figure 7 shows the process of preparing/selecting data for examinning replication. Figure 8 shows the number of EWAS per trait (for traits with >1 EWAS). Figure 9 - 13 shows the replication across the 5 traits selected. For the heatmaps, each of the rows/columns represents an EWAS with the code: “STUDY-NUM_ARRAY_TISSUE”, e.g. “1_HM450_Who” = the first study and for that EWAS the study measured DNAm in whole blood using the 450k array. The heatmaps should be read row-by-row. For each row, the results of the EWAS were restricted to those at P < 1x10-7 and the CpG sites were extracted. The CpGs were looked up in the other EWAS of the same trait, but without restricting the P value threshold below that of the catalog’s threshold (P < 1x10-4).

Preparing EWAS results for replication analyses The number of EWAS that a trait requires in order to be included into the analyses was chosen by examining Figure 8

Figure 7: Preparing EWAS results for replication analyses The number of EWAS that a trait requires in order to be included into the analyses was chosen by examining Figure 8

Number of EWAS per trait

Figure 8: Number of EWAS per trait

BMI heatmap

Figure 9: BMI heatmap

Alcohol consumption heatmap

Figure 10: Alcohol consumption heatmap

Birthweight heatmap

Figure 11: Birthweight heatmap

Smoking heatmap

Figure 12: Smoking heatmap

Maternal smoking during pregnancy heatmap

Figure 13: Maternal smoking during pregnancy heatmap

DNA methylation characteristics

Before assessing what CpG characteristics might, in part, explain some associations found in EWAS, sites were removed that were identified by potentially faulty probes and were on either of the sex chromosomes. Further, studies that did not include batch effects and cell composition as covariates in at least one EWAS model were removed. Overall, this left 1918 EWAS and 151322 associations (at P<1x10-4).

Distribution of beta values and DNA methylation levels after various transformations. The distribution of DNAm levels comes from the mean methylation levels of CpG sites across the GoDMC cohorts.

Figure 14: Distribution of beta values and DNA methylation levels after various transformations. The distribution of DNAm levels comes from the mean methylation levels of CpG sites across the GoDMC cohorts.

Associations of both h2 and DNAm variance with effect size. Variance and h2 were taken from GoDMC data.

Figure 15: Associations of both h2 and DNAm variance with effect size. Variance and h2 were taken from GoDMC data.

Model performance. Testing the performance of the model: beta ~ variance + h^2^

Figure 16: Model performance. Testing the performance of the model: beta ~ variance + h^2^

Differences in h2 and variance between CpGs that have replicated and those that have not. h2 and variance taken from GoDMC data. kw-p = p value from a kruskal-wallis test. med-diff = difference between medians.

Figure 17: Differences in h2 and variance between CpGs that have replicated and those that have not. h2 and variance taken from GoDMC data. kw-p = p value from a kruskal-wallis test. med-diff = difference between medians.

Predicting whether a CpG will be a DMP using h2 and variance. ROC curves from DMP ~ h^2^ + variance, DMP ~ h^2^, DMP ~ variance

Figure 18: Predicting whether a CpG will be a DMP using h2 and variance. ROC curves from DMP ~ h^2^ + variance, DMP ~ h^2^, DMP ~ variance

Enrichment of DMPs across genomic regions

Five different groups of DMPs were defined for the enrichment analyses:

  • Group A - all sites associated with any complex trait at the conventional P-value threshold used in EWAS, P<1x10-7.
  • Group B - a subset of group A, all sites associated with any complex trait at a more stringent threshold, P<5.2x10-11. Multiple EWAS were conducted to produce the results in the database and so the stricter threshold of group B aimed to limit the false discovery rate by taking into account the multiple EWAS.
  • Group C - DMPs replicated at P<1x10-4 in any other EWAS of the same trait.
  • Group D - a subset of group A, but restricted to results from studies where DNA methylation was measured in whole blood.
  • Group E - a subset of group B, but restricted to results from studies where DNA methylation was measured in whole blood.

When doing enrichment analyses a background is needed to test the DMPs of interest against. Figure X shows a representative plot how similar the GC frequency was between DMPs and the background.

GC frequency in the DMPs of interest and the background CpGs.

Figure 19: GC frequency in the DMPs of interest and the background CpGs.

Enrichment of DMPs for 25 chromatin states. Chromatin states across the genome of 127 cell types comprising 25 distinct tissues were available from the Roadmap Epigenomics Project. Using LOLA, the enrichment of DMPs from across all data in The EWAS Catalog for chromatin states were assessed. DMPs were divided into five categories as detailed in above. The x-axis show the 25 chromatin states: TssA, Active TSS; PromU, Promoter Upstream TSS; PromD1, Promoter Downstream TSS with DNase; PromD2, Promoter Downstream TSS; Tx5’, Transcription 5’; Tx, Transcription; Tx3’, Transcription 3’; TxWk, Weak transcription; TxReg, Transcription Regulatory; TxEnh5’, Transcription 5’ Enhancer; TxEnh3’, Transcription 3’ Enhancer; TxEnhW, Transcription Weak Enhancer; EnhA1, Active Enhancer 1; EnhA2, Active Enhancer 2; EnhAF, Active Enhancer Flank; EnhW1, Weak Enhancer 1; EnhW2, Weak Enhancer 2; EnhAc, Enhancer Acetylation Only; DNase, DNase only; ZNF/Rpts, ZNF genes & repeats; Het, Heterochromatin; PromP, Poised Promoter; PromBiv, Bivalent Promoter; ReprPC, Repressed PolyComb, Quies, Quiescent/Low.

Figure 20: Enrichment of DMPs for 25 chromatin states. Chromatin states across the genome of 127 cell types comprising 25 distinct tissues were available from the Roadmap Epigenomics Project. Using LOLA, the enrichment of DMPs from across all data in The EWAS Catalog for chromatin states were assessed. DMPs were divided into five categories as detailed in above. The x-axis show the 25 chromatin states: TssA, Active TSS; PromU, Promoter Upstream TSS; PromD1, Promoter Downstream TSS with DNase; PromD2, Promoter Downstream TSS; Tx5’, Transcription 5’; Tx, Transcription; Tx3’, Transcription 3’; TxWk, Weak transcription; TxReg, Transcription Regulatory; TxEnh5’, Transcription 5’ Enhancer; TxEnh3’, Transcription 3’ Enhancer; TxEnhW, Transcription Weak Enhancer; EnhA1, Active Enhancer 1; EnhA2, Active Enhancer 2; EnhAF, Active Enhancer Flank; EnhW1, Weak Enhancer 1; EnhW2, Weak Enhancer 2; EnhAc, Enhancer Acetylation Only; DNase, DNase only; ZNF/Rpts, ZNF genes & repeats; Het, Heterochromatin; PromP, Poised Promoter; PromBiv, Bivalent Promoter; ReprPC, Repressed PolyComb, Quies, Quiescent/Low.

Enrichment of DMPs for 167 transcription factor binding sites. Using LOLA, the enrichment of DMPs from across all data in The EWAS Catalog for 167 transcription factor binding sites confirmed across 25 distinct tissues were assessed. DMPs were divided into five categories as detailed in above. The x-axis show the 25 distinct tissues. All transcription factor binding sites have not been confirmed across all tissues. For some tissues (e.g. “Eye” and “Gingiva”) only five have been confirmed, but in blood over 131 have been confirmed.

Figure 21: Enrichment of DMPs for 167 transcription factor binding sites. Using LOLA, the enrichment of DMPs from across all data in The EWAS Catalog for 167 transcription factor binding sites confirmed across 25 distinct tissues were assessed. DMPs were divided into five categories as detailed in above. The x-axis show the 25 distinct tissues. All transcription factor binding sites have not been confirmed across all tissues. For some tissues (e.g. “Eye” and “Gingiva”) only five have been confirmed, but in blood over 131 have been confirmed.